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(57) Abstract 

A surveillance system having at least one primary video camera (10) for translating real images of a zone into electronic video signals 
at a first level of resolution. The system includes means for sampling movements (12) of an individual or individuals located within the 
zone from the video signal output from at least one video camera (10 or 20). Video signals of sampled movements of the individual are 
electronically compared (14) with known characteristics of movements which are indicative of individuals having a criminal intent. The 
level of criminal intent of the individual or individuals is then determined and an appropriate alarm signal is produced. 
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Abnormality Detection and Surveillance System 

10 Field of the Invention 

This invention generally relates to surveillance 
systems, and more particularly, to trainable surveillance 
systems which detect and respond to specific and specific 
abnormal video and audio input signals, 

15 

Background of the Invention 

Today's surveillance systems vary in complexity, 
efficiency and accuracy. Earlier surveillance systems use 
several closed circuit cameras, each connected to a devoted 

20 monitor. This type of system works sufficiently well for low- 
coverage sites, i.e., areas requiring up to perhaps six 
cameras. In such a system, a single person could scan the six 
monitors, in "real" time, and effectively monitor the entire 
{albeit small) protected area, offering a relatively high level 

25 of readiness to respond to" an abnormal act or situation 
observed within the protected area. In this simplest of 
surveillance systems, it is left to the discretion of security 
personnel to determine, first if there is any abnormal event 
in progress within the protected area, second, the level of 

30 concern placed on that particular event, and third, what 
actions should be taken in response to the particular event. 
The reliability of the entire system depends on the alertness 
and efficiency of the worker observing the monitors. 

Many surveillance systems, however, require the use 

35 of a greater number of cameras (e.g., more than six) to police 
a larger area, such as at least every room located within a 
large museum. To adequately ensure reliable and complete 
surveillance within the protected area, either more personnel 
must be employed to constantly watch the additionally required 
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monitors (one per camera) , or fewer monitors may be used on a 
simple rotation schedule wherein one monitor sequentially 
displays the output images of several cameras, displaying the 
images of each camera for perhaps a few seconds. In another 
5 prior art surveillance system (referred to as the "QUAD" 
system) , four cameras are connected to a single .monitor whose 
screen continuously and simultaneously displays the four 
different images. In a "quaded quad" prior art surveillance 
system, sixteen cameras are linked to a single monitor whose 

10 screen now displays, continuously and simultaneously all 
sixteen different images. These improvements allow fewer 
personnel to adequately supervise the monitors to cover the 
larger protected area. 

These improvements, however, still require the 

15 constant attention of at least one person. The above described 

multiple- image7siiigle " screen systems suf f ered - f rom -poor 

resolution and complex viewing. The reliability of the entire 
system is still dependent on the alertness and efficiency of 
the security personnel watching the monitors. The personnel 

2 0 watching the monitors are still burdened with identifying an 

abnormal act or condition shown on one of the monitors, 
determining which camera, and which corresponding zone of the 
protected area is recording the abnormal event, determining the 
level of concern placed on the particular event, and finally, 
25 determining the appropriate actions that must be taken to 
respond to the particular event. 

Eventually, it was recognized that human personnel 
could not reliably monitor the "real-time" images from one or 
several cameras for long "watch" periods of time. It iB 

3 0 natural for any person to become bored while performing a 

monotonous task, such as staring at one or several monitors 
continuously, waiting for something unusual or abnormal to 
occur; something which may never occur. 

As discussed above, it is the human link which lowers 
35 the overall reliability of the entire surveillance system. 
U.S. Patent 4,737,847 issued to Axaki et al . discloses an 
improved abnormality surveillance system wherein motion sensors 
are positioned within a protected area to first determine the 
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presence of an object of interest, such as an intruder. In the 
system disclosed by U.S. Patent 4 , 737, 847 , zones having 
prescribed "warning levels" are defined within the protected 
area. Depending on which of these zones an object or person 
5 is detected in, moves to, and the length of time the detected 
object or person remains in a particular zone determines 
whether the object or person entering the zone should be 
considered an abnormal event or a threat. 

The surveillance system disclosed in U.S. Patent 
10 4,737,847 does remove some of the monitoring responsibility 
otherwise placed on human personnel, however, such a system can 
only determine an intruder's "intent" by his presence relative 
to particular zones. The actual movements and sounds of the 
intruder are not measured or observed. A skilled criminal 
15 could easily determine the warning levels of obvious zones 
within a protected area and act accordingly; spending little 
time in zones having a high warning level, for example. 

It is therefore an object of the present invention 
to provide a surveillance system which overcomes the problems 
2 0 of the prior art. 

It is another object of the invention to provide such 
a surveillance system wherein a potentially abnormal event is 
determined by a computer prior to summoning a human supervisor. 

It is another object of the invention to provide a 

2 5 surveillance system which compares specific measured movements 

of a particular person or persons with a trainable, 
predetermined set of "typical" movements to determine the level 
and type of criminal or mischievous event . 

It is another object of this invention to provide a 

3 0 surveillance system which transmits the data from various 

sensors to a location where it can be recorded for evidentiary 
purposes. It is another object of this invention to 

provide such surveillance system which is operational day and 
night . 

3 5 It is another object of this invention to provide a 

surveillance system which can cull out real-time events which 
indicate criminal intent using a weapon, by resolving the low 
temperature of the weapon relative to the higher body 
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temperature and by recognizing the stances taken by the person 
with the weapon. 

It is yet another object of this invention to provide 
a surveillance system which does not require "real time" 
5 observation by human personnel* 
Incorporated by Reference 

The content of the following references is hereby 
incorporated by reference. 

1. Motz L . and L. Bergstein "Zoom Lens Systems 
10 Journal of Optical Society of America, 3 papers in Vol. 52, 

1992. 

2. D.G. Aviv, "Sensor Software Assessment of 
Advanced Earth Resources Satellite Systems", ARC Inc. Report 
870-80-A, pp2-107 through 2-119; NASA contract NAS-1- 16366. 

15 3. Shio, A. and J. Sklansky "Segmentation of People 

"in Motion" - , ProcT of~ IEEE Workshop on Visual Motion, -Princeton-,- 
NJ, October 1991. 

4. Agarwal, R. and J Sklansky "Estimating Optical 
Flow from Clustered Trajectory Velocity Time". 
20 5. Suzuki, S. and J Sklansky "Extracting Non-Rigid 

Moving Objects by Temporal Edges", IEEE, 1992, Transactions of 
Pattern Recognition. 

6. Rabiner, L. and Biing-Hwang Juang "Fundamental 
of Speech Recognition", Pub. Prentice Hall, 1993, (p. 434-495). 
25 7. Weibel, A. and Kai-Fu Lee Eds. "Readings in 

Speech Recognition" , Pub. Morgan Kaaufman, 1990 (p. 267-296) . 

8. Rabiner, L. "Speech Recognition and Speech 
Synthesis Systems", Proc. IEEE, January, 1994. 

30 Summary of the Invention 

A surveillance system having at least one primary 
video camera for translating real images of a zone into 
electronic video signals at a first level of resolution; 

means for sampling movements of an individual or 
35 individuals located within the zone from the video signal 
output from at least one video camera; 

means for electronically comparing the video signals 
of sampled movements of the individual with known 
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characteristics of movements which are indicative of 
individuals having a criminal intent; 

means for determining the level of criminal intent 
of the individual or individuals; 
5 means for activating at least one secondary sensor 

and associated recording device having a second higher level 
of resolution, said activating means being in response to 
determining that the individual has a predetermined level of 
criminal intent . 

10 

Brief Description of the Drawings 

Figure 1 is a schematic block diagram of the video, 
analysis, control, alarm and recording subsystems embodying 
this invention; 

15 Figure 2 is an array of frames describing how the 

process of segmentation and location of the different objects 
(people) in the image raster will identify interaction between 
people ; 

Figure 3 is an array of frames indicating the "two 
20 on one" interaction, wherein two objects (people) accost a 
third object (person) ; and 

Figure 4 is a schematic block diagram of a 
conventional word recognition system. 

Detailed Description of the Prefe rred Embodiments 

Referring to Fig. 1, the basic elements of one 
embodiment of the invention are illustrated, including picture 
input means 10, which may be any conventional electronic 
picture pickup device operational within the infrared or visual 
spectrum (or both) including a vidicon and a CCD/TV camera 
(including the wireless type) . 

In another embodiment of picture input means 10, 
there is the deployment of a high rate camera/recorder, capable 
of recording up to 200 frames/sec (similar to those made by NAC 
Visual Systems of Woodland Hills, CA, SONY and others) . Such 
high rate camera/ recorder systems are able to detect and record 
very rapid movements of body parts that are commonly indicative 



25 



30 



35 
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of a criminal intent. Such fast movements are typically not 
resolved with a more standard 30 frames per second camera. 

This picture means, may also be triggered by an alert 
signal from the processor of the low resolution camera or, as 
5 before, from the audio/word recognition processor when sensing 
a suspicious event. 

In this first embodiment, the primary picture input 
means 10 is preferably a low cost video camera wherein high 
resolution is not necessary and due to the relative expense 
10 will most likely provide only moderate resolution. ( (The 
preferred CCD/TV camera is about 1 1/2 inches in length and 
about 1 inch in diameter, weighing about 3 ounces, and for 
particular deployment, a zoom lens attachment may be used) . 
This device will be operating continuously and will translate 
15 the field of view ("real") images within a first observation 
area into" convent id^l vilieo'ete " 

In another embodiment of picture input means 10 , a 
high rate camera/ recorder , up to 300 frames/sec (similar to 
those made by NAC Visual Systems of Woodland Hills, CA, SONY 

2 0 and others) is used, which would then enable the detection of 

even the very rapid movement of body parts that are indicative 
of criminal intent, and their recording. The more commonly 
used camera operates at 3 0 frames per second and cannot captive 
quick body movement with sufficient resolution. 
25 The picture input means may also be activated by an 

"alert" signal from the processor of the low resolution camera 
or from the audio/word recognition processor when sensing a 
suspicious event. 

The picture input means for any embodiment contains 

3 0 a preprocessor which normalizes a wide range of illumination 

levels, especially for outside observation. The preprocessor 
to emulates a vertebrate's retina, which **has a an efficient 
and accurate normalization process. One such preprocessor 
(VLSI retina chip) is fabricated by the Carver Meade Laboratory 
3 5 of the California Institute of Technology in Pasadena, 
California. Use of this particular preprocessor chip will 
increase the automated vision capability of this invention 
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whenever variation of light intensity and light reflection may 
otherwise weaken the picture resolution. 

The signals from the picture input means 10 are 
converted into digitized signals and then sent to the picture 
5 processing means 12. 

The processor controlling each group of cameras will 
be governed by an artificial intelligence system, based on 
dynamic pattern recognition principles, as further described 
below. 

10 The picture processing means 12 includes an image 

raster analyzer which effectively segments each image to 
isolate each pair of people . 

The image raster analyzer subsystem of picture 
processing means 12 segments each sampled image to identify and 

15 isolate each pair of objects (or people) , and each "two on one" 
group of 3 people separately. 

The "2 on l n represents a common mugging situation 
in which two individuals approach a victim: one from in front 
of the victim and the other from behind. The forward mugger 

20 tells the potential victim that if he does not give up his 
money, (or watch, ring, etc.) the second mugger will shoot him, 
stab or otherwise harm him. The group of three people will 
thus be considered a potential crime in progress and will 
therefore be segmented and analyzed in picture processing 

25 means. 

An additional embodiment of the picture means 1 is 
the inclusion of an optics system known as the zoom lens 
system. The essentials of the zoom lens subsystem are 
described in three papers written by L. Motz and L. Bergstein, 

3 0 in an article titled "Zoom Lens Systems" in the Journal of 
Optical Society of America, Vol. 52, April, 1992. This article 
is hereby incorporated by reference. 

The essence of the zoom system is to vary the focal 
length such that an object being observed will be focused and 

35 magnified at its image plane. In an automatic version of the 
zoom system once an object is in the camera's f ield-of -view 
(FOV) , the lens which moves to focus the object onto the 
camera's image plane. An error which is used to correct the 
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focus, by the image planes 's is generated by CCD array into 2 
halves and measuring the difference segmenting in each until 
the object is at the center. Dividing the CCD array into more 
than 2 segments, say 4 quadrants is a way to achieve automatic 
5 centering, as is the case with mono -pulse radar. Regardless 
of the number of segments, the error signal is used to generate 
the desired tracking of the object. 

In a wide field- of -view (WFOV operation, there may 
be more than one object, thus special attention is given to the 
0 design of the zoom system and its associated software and 
firmware control. Assuming 3 objects, as is the n 2 on 1" 
potential mugging threat described above, and that the 3 
persons are all in one plane, one can program a shifting from 
one object to the next, from one face to another face, in a 
5 prescribed sequential order. Moreover, as the objects move 
"within "the WFOV they will" be automatically tracked -in- azimuth- 
and elevation. In principle, the zoom would focus on the 
nearest object, assuming that the amount of light on each 
object is the same so that the prescribed sequence starting 
0 from the closes object will proceed to the remaining objects 
from, for example, right to left. 

However, when the 3 objects are located in different 
planes, but still within the camera's WFOV, the zoom, with 
input from the segmentation subsystem of the picture analysis 
5 means 12 will focus on the object closest to the right hand 
side of the image plane, and then proceed to move the focus to 
the left, focusing on the next object and on the next 
sequentially . 

In all of the above cases, the automatic zoom can 
0 more naturally choose to home- in on the person with the 
brightest emission or reflection, and then proceed to the next 
brightness and so forth. This would be a form of an 
intensity/time selection multiplex zoom system. 

The relative positioning of the input camera with 
5 respect to the area under surveillance will effect the accuracy 
by which the image raster analyzer segments each image. In 
this preferred embodiment, it is beneficial for the input 
camera to view the area under surveillance from a point located 
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directly above, e.g., with the input camera mounted high on a 
wall, a utility tower, or a traffic light support tower. The 
height of the input camera is preferably sufficient to minimize 
occlusion between the input camera and the movement of the 
5 individuals under surveillance. 

Once the objects within each sampled video frame are 
segmented (i.e., detected and isolated), an analysis is made 
of the detailed movements of each object located within each 
particular segment of each image, and their relative movements 
10 with respect to the other objects. 

Each image frame segment, once digitized, is stored 
in a frame by frame memory storage of section 12 . Each frame 
from the camera input 10 is subtracted from a previous frame 
already stored in memory 12 using any conventional differencing 
15 process. The differencing process involving multiple 

differencing steps takes place in the differencing section 12. 
The resulting difference signal (outputted from the 
differencing sub-section 12) of each image indicates all the 
changes that have occurred from one frame to the next. These 
2 0 changes include any movements of the individuals located within 
the segment and any movements of their limbs, e.g., arms. 

A collection of differencing signals for each moved 
object of subsequent sampled frames of images (called a 
"track") allows a determination of the type, speed and 
25 direction (vector) of each motion involved and also processing 
which will extract acceleration, i.e., note of change of 
velocity: and change in acceleration with respect to time 
(called " jerkiness") and will when correlating with stored 
signatures of known physical criminal acts. For example, 
30 subsequent differencing signals may reveal that an individual's 
arm is moving to a high position, such as th~ upper limit of 
that arm's motion, i.e., above his head) at a fast speed. This 
particular movement could be perceived, as described below, as 
a hostile movement with a possible criminal intent requiring 
35 the expert analysis of security personnel. 

The intersection of two tracks indicates the 
intersection of two moved objects. The intersecting objects, 
in this case, could be merely the two hands of two people 
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greeting each other, or depending on other characteristics, as 
described below, the intersecting objects could be interpreted 
as a fist of an assailant contacting the face of a victim in 
a less friendly greeting. In any event, the intersection of 
5 two tracks immediately requires further analysis and/or the 
summoning of security personnel. But the generation of an 
alarm, light and sound devices located, for example, on a 
monitor will turn a guard's attention only to that monitor, 
hence the labor savings. In general however, friendly 
10 interactions between individuals is a much slower physical 
process than is a physical assault vis-a-vis body parts of the 
individuals involved. Hence, friendly interactions may be 
easily distinguished from hostile physical acts using current 
low pass and high pass filters, and current pattern recognition 
15 techniques based on experimental reference data, 

- - - when ~ a large" number of -sensors -( called - a -sensor . 
suite) are distributed over a large number facilities, for 
example, a number of ATMs (automatic teller machines) , 
associated with particular bank branches and in a particular 
20 state or states and all operated under a single bank network 
control, then only one monitor is required. 

A commercially available software tool may enhance 
object -movement analysis between frames (called optical flow 
computation). (see ref. 3 and 4) With optical flow 
25 computation, specific (usually bright) reflective elements, 
called farkles, emitted from the clothing and/or the body parts 
of an individual of one frame are subtracted from a previous 
frame. The bright portions will inherently provide sharper 
detail and therefore will yield more accurate data regarding 
3 0 the velocities of the relative moving objects. Additional 
computation, as described below, will provide data regarding 
the acceleration and even change in acceleration or "jerkiness" 
of each moving part sampled. 

The physical motions of the individuals involved in 
3 5 an interaction, will be detected by first determining the edges 
of the of each person imaged. And the movements of the body 
parts will then be observed by noting the r»ovements of the 
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edges of the body parts of the (2 or 3) individuals involved 
in the interaction. 

The differencing process will enable the 
determination of the velocity and acceleration and rate of 

5 acceleration of those body parts. 

The now processed signal is sent to comparison means 
14 which compares selected frames of the video signals from the 
picture input means 10 with "signature" video signals stored 
in memory 16. The signature signals are representative of 

10 various positions and movements of the body ports of an 
individual having various levels of criminal intent. The 
method for obtaining the data base of these signature video 
signals in accordance with another aspect of Lhe invention is 
described in greater detail below, 

15 if a comparison is made positive with one or more of 

the signature video signals, an output "alert" signal is sent 
from the comparison means 14 to a controller 18. The 
controller 18 controls the operation of a secondary, high 
resolution picture input means (video camera) 20 and a 

20 conventional monitor 22 and video recorder 24. The field of 
view of the secondary camera 20 is preferably at most, the same 
as the field of view of the primary camera 10, surveying a 
second observation area. The recorder 24 may be located at the 
site and/or at both a law enforcement facility (not shown) and 

25 simultaneously at a Court office or legal facility to prevent 
loss of incriminating information due to tampering. 

The purpose of the secondary camera 2 0 is to provide 
a detailed video signal of the individual having assumed 
criminal intent and also to improve false positive and false 

30 negative performance. This information is recorded by the 
video recorder 24 and displayed on a monitor 22 . An alarm bell 
or light (not shown) or both may be provided and activated by 
an output signal from the controller 20 to summon a supervisor 
to immediately view the pertinent video images showing the 

35 apparent crime in progress and access its accuracy. 

In still another embodiment of the invention, a VCR 
26 is operating continuously (using a 6 hour loop- tape, for 
example) . The VCR 26 is being controlled by the VCR controller 
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28. All the "real-time" images directly from the picture input 
means 10 are immediately recorded and stored for at least 6 
hours, for example. Should it be determined that a crime is 
in progress, a signal from the controller 18 is sent to the VCR 
5 controller 28 changing the mode of recording from tape looping 
mode to non- looping mode. Once the VCR 26 is changed to a non- 
looping mode, the tape will not re- loop and will therefore 
retain the perhaps vital recorded video information of the 
surveyed site, including the crime itself, and the events 
10 leading up to the crime. 

When the non- looping mode is initiated, the video 
signal may also be transmitted to a VCR located elsewhere; for 
example, at a law enforcement facility and, simultaneously to 
other secure locations of the Court and its associated offices. 

15 

- - — Prior" to the "video" signals^ being -compared with- the- 
" signature" signals stored in memory, each sampled frame of 
video is "segmented" into parts relating to the objects 
detected therein. To segment a video signal, the video signal 

2 0 derived from the vidicon or CCD/TV camera is analyzed by an 
image raster analyzer* Although this process causes slight 
signal delays, it is accomplished nearly in real time. 

At certain sites, or in certain situations, a high 
resolution camera may not be required or otherwise used. For 

25 example, the resolution provided by a relatively simple and low 
cost camera may be sufficient. Depending on the level of 
security for the particular location being surveyed, and the 
time of day, the length of frame intervals between analyzed 
frames may vary. For example, in a high risk area, every frame 

30 from the CCD/TV camera may be analyzed continuously to ensure 
that the maximum amount of information is recorded prior to and 
during a crime. In a low risk area, it may be preferred to 
sample perhaps every 10 frames from each camera, sequentially. 
If, during such a sampling, it is determined that an abnormal 

35 or suspicious event is occurring, such as two people moving 
very close to each other, then the system would activate an 
alert mode wherein the system becomes "concerned and curious" 
in the suspicious actions and the sampling rate is increased 
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to perhaps every 5 frames or even every frame. As described 
in greater detail below, depending on the type of system 
employed (i.e., video only, audio only or both), during such 
an alert mode, the entire system may be activated wherein both 
5 audio and video system begin to sample the environment for 
sufficient information to determine the intent of the actions. 

Referring to Fig. 2, several frames of a particular 
camera output are shown to illustrate the segmentation process 
performed in accordance with the invention. The system begins 

10 to sample at frame K and determines that there are four objects 
(previously determined to be people, as described below) , A-D 
located within a particular zone being policed. Since nothing 
unusual is determined from the initial analysis, the system 
does not warrant an "alert" status. People A, B, and D are 

15 moving according to normal, non- criminal intent, as could be 
observed. 

A crime likelihood is indicated when frames K+10 
through K+13 are analyzed by the differencing process. And if 
the movement of the body parts indicate velocity, acceleration 

2 0 and "jerkiness" that compare positively with the stored digital 

signals depicting movements of known criminal physical 
assaults, it is likely that a crime is in progress here. 

Additionally, if a large velocity of departure is 
indicated when person C moves away from person B, as indicated 
25 in frames K+15 through K+17, a larger level of confidence, is 
attained in deciding that a physical criminal act has taken 
plate or is about to. 

An alarm is generated the instant any of the above 
conditions is established. This alarm condition will result 

3 0 in sending in Police or Guards to the crime site, activating 

the high resolution CCD/TV camera to record the face of the 
person committing the assault, a loud speaker being activated 
automatically, playing a recorded announcement warning the 
perpetrator the seriousness of his actions now being undertaken 
3 5 and demanding that he cease the criminal act. After dark a 
strong light will be turned on automatically. The automated 
responses will be actuated the instant an alarm condition is 
adjudicated by the processor. Furthermore, an alarm signal is 
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sent to the police station and the same video signal of the 
event, is transmitted to a court appointed data collection 
office, to the Public Defender's office and the District 
Attorney's Office. 
5 As described above, it is necessary to compare the 

resulting signature of physical body parts motion involved in 
a physical criminal act, that is expressed by specific motion 
characteristics (i.e., velocity, acceleration, change of 
acceleration) , with a set of signature files of physical 

10 criminal acts, in which body parts motion are equally involved. 
This comparison , is commonly referred to as pattern matching 
and is part of the pattern recognition process. 

The files of physical criminal acts, which involve 
body parts movements such as hands, arms, elbows, shoulder, 

15 head, torso, legs, and feet we obtained, a priority, by 
experiments and simulations' of physical criminal acts gathered 
from "dramas" that are enacted by professional actors, the data 
gathered from experienced muggers who have been caught by the 
police as well as victims who have reported details of their 

20 experiences will help the actors perform accurately. Video of 
their motions involved in these simulated acts will be stored 
in digitized form and files prepared for each of the body parts 
involved, in the simulated physical criminal acts. 

The present invention could be easily implemented at 

2 5 various sites to create effective "Crime Free" zones. In 

another embodiment, the above described Abnormality Detection 
System includes an RF-ID (Radio Frequency Identification) tag, 
to assist in the detection and tracking of individuals within 
the field of view of a camera. 
30 I.D. cards or tags are worn by authorized 

individuals. The tags response when queried by the RF 
Interrogator. The response signal of the tags propagation 
pattern which is adequately registered with the video sensor. 
The "Tags" are sensed in video are assumed friendly and 

3 5 authorized. This information will simplify the segmentation 

process . 

A light connected to each RF-ID card will be turned 
ON, when a positive response to an interrogation signal is 
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established. The light will appear on the computer generated 
grid (also on the screen of the monitor) and the intersection 
of tracks clearly indicated, followed by their physical 
interaction. But also noted will be the intersection between 
5 the tagged and the untagged individuals. In all of such cases, 
the segmentation process will be simpler. 

There are many manufacturers of RF-ID cards and 
Interrogators, three major ones are, The David Sarnoff Research 
Center of Princeton, New Jersey, AMTECH of Dallas, Texas and 
10 MICRON Technology of Boise, Idaho. 

The applications of the present invention include 
stationary facilities: banks and ATMs, hotels, private 
residence halls and dormitories, high rise and low rise office 

15 and residential buildings, public and private schools from 
kindergarten through high- school, colleges a**d universities, 
hospitals, sidewalks, street crossing, parks, containers and 
container loading areas, shipping piers, train stations, truck 
loading stations, airport passenger and freight facilities, bus 

20 stations, subway stations , move houses, theaters, concert halls 
and arenas, sport arenas, libraries, churches, museums, stores, 
shopping malls, restaurants, convenience stores, bars, coffee 
shops, gasoline stations, highway rest stops, tunnels, bridges, 
gateways, sections of highways, toll booths, warehouses, and 

25 depots, factories and assembly rooms, law enforcement 
facilities including jails. 

Further applications of the invention include areas 
of moving platforms: automobiles, trucks, buses, subway cars, 

3 0 train cars, freight and passenger, boats and ships (passenger 
and freight, tankers, service vehicles, construction vehicles, 
on and off -road, containers and their carriers, and airplanes. 
And also in military applications that will include but will 
not be limited to assorted military ground, sea, and air mobile 

35 vehicles and assorted military ground, sea, and air mobile 
vehicles and platforms as well as stationary facilities where 
the protection of low, medium, and high value targets are 
necessary; such targets are common in the military but have 
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equivalents in the civilian areas wherein this invention will 
serve both sectors. 

As a deterrence to car -jacking a tiny CCD/TV camera 
connected surreptitiously at the ceiling of the ar, or in the 
5 backview mirror, through a pin hole lens and focused at the 
driver's seat, will be connected to the video processor to 
record the face of the drive. The camera is triggered by the 
automatic word recognition processor that will identify the 
well known expressions commonly used by the car-jacker. The 

10 video picture will be recorded and then transmitted via 
cellular phone in the car. Without a phone, the short video 
recording of the face of the car-jacker will be held until the 
car is found by the police, but now with the evidence (the 
picture of the car-jacker) in hand. 

15 in this present surveillance system, the security 

personnel manning the m6riito~rs are arerted~onxy to video images 
which show suspicious actions (criminal activities) within a 
prescribed observation zone. The security personnel are 
therefore used to access the accuracy of the crime and 

2 0 determine the necessary actions for an appropriate response. 
By using computers to effectively filter out all normal and 
noncriminal video signals from observation areas, fewer 
security personnel are required to survey and "secure" a 
greater overall area (including a greater number of observation 

25 areas, i.e., cameras). 

It is also contemplated that the present system could 
be applied to assist blind people "see". A battery operated 
portable version of the video system would automatically 
identify known objects in its field of view and a speech 

30 synthesizer would "say" the object. For example, "chair", 
"table", etc. would indicate the presence of a chair and a 
table . 

Depending on the area to be policed, it is preferable 
that at least two and perhaps three cameras (or video sensors) 
35 are used simultaneously to cover the area. Should one camera 
sense a first level of criminal action, the other two could be 
manipulated to provide a three dimensional perspective coverage 
of the action. The three dimensional image of a physical 
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interaction in the policed area would allow observation of a 
greater number of details associated with the steps: accost, 
threat, assault, response and post response. The conversion 
from the two dimensional image to the three dimensional image 
5 is known as "random transform" . 

In the extended operation phase of the invention as 
more details of the physical variation of movement 
characteristics of physical threats and assaults against a 
victim and also the speaker independent (male, female of 

10 different ages groups) and dialect independent words and terse 
sentences, with corresponding responses, will enable automatic 
recognition of a criminal assault, without he need of guard, 
unless required by statutes and other external requirements. 

In another embodiment of the present invention, both 

15 video and acoustic information is sampled and analyzed. The 
acoustic information is sampled and analyzed in a similar 
manner to the sampling and analyzing of the above -described 
video information. The audio information is sampled and 
analyzed in a manner shown in Fig. 4, and is based on prior 

20 art. (references 6 and 7) . 

The employment of the audio speech band, with its 
associated Automatic Speech Recognition (ASR) system, will not 
only reduce the false alarm rate resulting from the video 
analysis, but can also be used to trigger the video and other 

25 sensors if the sound threat predates the observed threat. 

Referring to Fig. 4, a conventional automatic word 
recognition system is shown, including an input microphone 
system 40, an analysis subsystem 42, a template subsystem 44, 
a pattern comparator 46, and a post -processor and decision 

30 logic subsystem 48. 

In operation, upon activation, the acoustic/audio 
policing system will begin sampling all (or a selected portion) 
of nearby acoustic signals. The acoustic signals will include 
voices and background noise. The background noise signals are 

35 generally known and predictable, and may therefore be easily 
filtered out using conventional filtering techniques. Among 
the expected noise signals are unfamiliar speech, automotive 
related sounds, honking, sirens, the sound of wind and/or rain. 
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The microphone input system 40 pick-up the acoustic 
signals and immediately filter out the predictable background 
noise signals and amplify the remaining recognizable acoustic 
signals. The filtered acoustic signals are analyzed in the 
5 analysis subsystem 42 which processes the signals by means of 
digital and spectral analysis techniques . The output of the 
analysis subsystem is compared in the pattern comparater 
subsystem 46 with selected predetermined words stored in memory 
in 44. The post processing and decision logic subsystem 48 
10 generates an alarm signal, as described below. 

The templates 44 include perhaps about 100 brief and 
easily recognizable terse expressions, some of which are single 
words, and are commonly used by those intent on a criminal act. 
Some examples of commonly used word phrases spoken by a 
15 criminal to a victim prior to a mugging, for example, include: 
~ "Give ~me~ your moiTey " ,~ "This is a stick -up "7 "Give- -me- -you-r--- 
wallet and you won't get hurt" . . .etc. Furthermore, commonly 
used- replies from a typical victim during such a mugging may 
also be stored as template words, such as "help", and certain 
2 0 sounds such as shrieks, screams and groans, etc. 

The specific word templates, from which inputed 
acoustic sounds are compared with, must be chosen carefully, 
taking into account the particular accents and slang of the 
language spoken in the region of concern {e.g. , the southern 
25 cities of the U.S. will require a different template 44 than 
the one used for a recognition system in the New York City 
region of the U.S.). 

The output of the word recognition system shown in 
Fig. 4 is used as a trigger signal to activate a sound 
30 recorder, or a camera used elsewhere in the invention, as 
described below. 

The preferred microphone used in the microphone input 
subsystem 40 is a shot-gun microphone, such as those 
commercially available from the Sennheiser Company of 
35 Frankfurt, Germany. These microphone have a super- cardioid 
propagation pattern. However, the gain of the pattern may be 
too small for high traffic areas and may therefore require more 
than one microphone in an array configuration to adequately 
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focus and track in these areas. The propagation pattern of the 
microphone system enables better focusing on a moving sound 
source (e.g., a person walking and talking). A conventional 
directional microphone may also be used in place of a shot-gun 
5 type microphone, such as those made by the Sony Corporation of 
Tokyo, Japan. Such directional microphones will achieve 
similar gain to the shot-gun type microphones, but with a 
smaller physical structure. 

A feedback loop circuit (not specifically shown) 

10 originating in the post processing subsystem 48 will direct the 
microphone system to track a particular dynamic source of sound 
within the area surveyed by video cameras. 

An override signal from the video portion of the 
present invention will activate and direct the microphone 

15 system towards the direction of the field of view of the 
camera. In other words, should the video system detect a 
potential crime in progress, the video system will control the 
audio recording system towards the scene of interest. 
Likewise, should the audio system detect words of an aggressive 

20 nature, as described above, the audio system will direct 
appropriate video cameras to visually cover and record the 
apparent source of the sound. 

A number of companies have developed very accurate 
and efficient, speaker independent word recognition systems 

2 5 based on a hidden Markov model (HMM) in combination with an 
artificial neural network (ANN) . These companies include IBM 
of Armonk, NY / AT&T Bell Laboratories, Kurtxweil of Carnpbridge, 
MA and Lernout and Hauspie of Belgium. 

Put briefly, the HMM system uses probability 

30 statistics to predict a particular spoken word following 
recognition of a primary word unit, sylabol or phoneme. For 
example, as the word "money" is inputted into an HMM word 
recognition system, the first recognized portion of the word 
is "mon...". The HMM system immediately recognizes this word 

35 stem and determines that the spoken word could be "MONday", 
"MONopoly", or "MONey", etc. The resulting list of potential 
words is considerably shorter than the entire list of all 
spoken words of the English language. Therefore, the HMM 
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system employed with the present invention allows both the 
audio and video systems to operate quickly and use HMM 
probability statistics to predict future movements or words 
based on an early recognition of initial movements and word 
5 stems. 

The HMM system may be equally employed in the video 
recognition system. For example, if a person's arm quickly 
moves above his head, the HMM system may determine that there 
is a high probibility that the arm will quickly come down, 
0 perhaps indicating a criminal intent. 
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What is claimed is : 

1 1. A surveillance system, comprising: 

2 at least one primary video camera for translating 

3 real images of an area into electronic video signals at a first 

4 level of resolution; 

5 means for sampling the movements of an individual or 

6 individuals located within the area from the video signal 

7 output from said at least one primary video camera; 

8 means for electronically comparing the video signals 

9 of said movements of said sampling means of the individual with 

10 known characteristics of movements which are indicative of 

11 individuals having a criminal intent; 

12 means for determining the level of criminal intent 

13 of the individual or individuals, said determining means being 

14 dependent on said electronic comparing means; and 

15 means for generating a signal indicating a 

16 predetermined level of criminal intent is present, 

1 2. The survellance system in accordance with claim 

2 l, wherein said signal generating means activates a video 

3 signal recorder for recording said video signals from said at 

4 least one primary camera. 

1 3 . The surveillance system in accordance with claim 

2 l # wherein said signal generating means activates a microphone 

3 for receiving audible information of said individuals located 

4 in said area. 



1 
2 
3 



4 . The surveillance system in accordance with claim 
l # wherein said signal generating means activates at least one 
secondary high resolution camera. 
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